!pr1
Even Faster 65802 16x16 Multiply......Bob Sander-Cederlof

Bob Boughner, faithful reader from Yorktown, Virginia, decided that the challenge at the end of my article in the January 1986 AAL could not be ignored.  He was able to slightly increase the speed of my 16x16 multiply subroutine for the 65802.  After studying his code, I made a few more little changes and squeezed out even more cycles.

To see just how much faster the new subroutine is, I carefully counted the cycles, and then went back and did the same to January's subroutine.  For some reason I got a new answer for January's program, slightly slower than published.  Here are the results:

               Minimum  Maximum  Average
       January   333      693      513
       New One   321      633      477

The times include 6 cycles for a JSR to call the subroutine, and 6 cycles for the RTS to return.  By putting the code in-line, even these 12 cycles could be eliminated.   The so-called average time is merely the arithmetic average of the minimum and maximum times.  The "real" average for random factors will be faster, because one or both of the INC instructions at lines 1350 and 1430 would be skipped.  In fact, almost always at least one would be skipped, saving 48 cycles.  Note also that if the factor in CAND is zero, the total time is only 45 cycles.

In counting cycles I assumed that the D-register, which tells the 65802 where the direct page is, has a low byte = 0.  If it is non-zero, all of the references to CAND, PLIER, and PROD would require one more cycle.

The new subroutine is only 4 bytes longer than the January one.  The new one uses the Y-register, while the old one did not.  There are three tricks in the new code which save time.  The first one is holding the multiplicand in the Y-register, so that TYA instructions can be used at lines 1310 and 1390.  This saves 2 cycles each time, or a total of 32 cycles in the maximum case.  The cost is the LDY CAND in line 1200, 4 cycles.

The second trick eliminates the CLC instruction before the multiplier is added in lines 1370-1430.  The savings is 16 cycles maximum, and the cost is 8 cycles to set it up in lines 1120-1140 by inverting the high byte of the multiplier.  This doesn't affect the average time any, but it does lower the maximum time.

The third trick is at lines 1280 and 1290.  I saved 24 cycles by eliminating January's AND ##$0080 instruction here.  The LDA PLIER-1 instruction picks up the low byte of the multiplier in the high byte of the A-register, allowing me to see what bit 7 of the multiplier is without any masking or shifting.
